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Abstract —Smart devices with built-in sensors, computational 
capabilities, and network connectivity have become increasingly 
pervasive. The crowds of smart devices offer opportunities to col¬ 
lectively sense and perform computing tasks in an unprecedented 
scale. This paper presents Crowd-ML, a privacy-preserving 
machine learning framework for a crowd of smart devices, which 
can solve a wide range of learning problems for crowdsensing 
data with differential privacy guarantees. Crowd-ML endows 
a crowdsensing system with an ability to learn classifiers or 
predictors online from crowdsensing data privately with minimal 
computational overheads on devices and servers, suitable for 
a practical and large-scale employment of the framework. We 
analyze the performance and the scalability of Crowd-ML, and 
implement the system with off-the-shelf smartphones as a proof 
of concept. We demonstrate the advantages of Crowd-ML with 
real and simulated experiments under various conditions. 

1. Introduction 

A. Crowdsensing 

Smart devices are increasingly pervasive in daily life. 
These devices are characterized by their built-in sensors (e.g., 
accelerometers, cameras, and, microphones), programmable 
computation ability, and Internet connectivity via wireless or 
cellular networks. These include stationary devices such as 
smart thermostats and mobile devices such as smartphones 
or in-vehicle systems. More and more devices are also being 
interconnected, often referred to as the “Internet of Things.” 
Inter-connectivity offers opportunities for crowds of smart 
devices to collectively sense and compute in an unprecedented 
scale. Various applications of crowdsensing have been pro¬ 
posed, including personal health/fitness monitoring, environ¬ 
mental sensing, and monitoring road/traffic conditions (see 
Section |II-A| ), and the list is currently expanding. 

Crowdsensing is used primarily for collecting and analyzing 
aggregate data from a population of participants. However, 
more complex and useful tasks can be performed beyond 
calculation of aggregate statistics, by using machine learning 
algorithms on crowdsensing data. Examples of such tasks 
include: learning optimal settings of room temperatures for 
smart thermostats; predicting user activity for context-aware 
services and physical monitoring; suggesting the best driving 
routes; recognizing audio events from microphone sensors. 
Specific algorithms and data types for these tasks are different, 
but they can all be trained in standard unsupervised or super¬ 
vised learning settings: given sensory features (time, location. 


motion, environmental measures, etc.), train an algorithm 
or model that can accurately predict a variable of interest 
(temperature setting, current user activity, amount of traffic, 
audio events, etc.). Conventionally, crowdsensing and machine 
learning are performed as two separate processes: devices 
passively collect and send data to a central location, and 
analyses or learning procedures are performed at the remote 
location. However, current generations of smart devices have 
computing capabilities in addition to sensing. In this paper, 
we propose to utilize computing capabilities of smart devices, 
and integrate sensing and learning processes together into a 
crowdsensing system. As we will show, the integration allows 
us to design a system with better privacy and scalability. 

B. Privacy 

Privacy is an important issue for crowdsensing applications. 
By assuring participants’ privacy, a crowdsensing system 
can appeal to a larger population of potential participants, 
which increases the utility of such a system. However, many 
crowdsensing systems in the literature do not employ any 
privacy-preserving mechanism (see Section [ILB] ), and existing 
mechanisms used in crowdsensing (see Q) are often difficult 
to compare qualitatively across different systems or data types. 
In the last decade, differential privacy has gained popularity 
as a formal and quantifiable measure of privacy risk in data 
publishing 0-0. Briefly, differential privacy measures how 
much the outcome of a procedure changes probabilistically by 
presence/absence of any single subject in the original data. The 
measure provides an upper bound on privacy loss regardless 
of the content of data or any prior knowledge an adversary 
might have. While differential privacy has been applied in 
data publishing and machine learning, (see Section 
it has not been broadly adopted in crowdsensing systems. 
In this paper, we integrate differentially private mechanisms 
into the crowdsensing system as well, which can provide 
strong protection against various types of possible attacks (see 
Section [TlI-CI ). 

C. Proposed work 

This paper presents Crowd-ML, a privacy-preserving ma¬ 
chine learning framework for crowdsensing system that con¬ 
sists of a server and smart devices (see Eig. [^. Crowd-ML 
is a distributed learning framework that integrates sensing. 











Fig. 1: Crowd-ML consists of a server and a number of smart 
devices. The system integrates sensing, learning, and privacy 
mechanisms together, to learn a classifier or predictor from 
device-generated data in an online and distributed way, with 
formal privacy guarantees. 


learning, and privacy mechanisms together, and can build 
classifiers or predictors of interest from crowdsensing data 
using computing capability of devices with formal privacy 
guarantees. 

Algorithmically, Crowd-ML learns a classifier or predictor 
by a distributed incremental optimization. Optimal parameters 
of a classifier or predictor are found by minimizing the risk 
function associated with a given task (see Section EES 
for details). Specifically, the framework finds optimal param¬ 
eters by incrementally minimizing the risk function using a 
variant of stochastic (sub)gradient descent (SGD) ||^. Unlike 
batch learning, SGD requires only the gradient information 
to be communicated between devices and a server, which 
has two important consequences: 1) computation load can be 
distributed among the devices, enhancing scalability of the sys¬ 
tem; 2) private data of the devices need not be communicated 
directly, enhancing privacy. By exploiting these two properties, 
Crowd-ML efficiently learns a classifier or predictor from a 
crowd of devices, with a guarantee of e-differential privacy. 
Differential privacy mechanism is applied locally on each 
device, using Laplace noise for the gradients and exponential 
mechanisms for other information (see Section |ni-C| ). 

We show advantages of Crowd-ML by analyzing its scala¬ 
bility and privacy-performance trade-offs (Section [rv|), and by 
testing the framework with demonstrative tasks implemented 
on Android smartphones and in simulated environments under 
various conditions (see Section 0 . 

In summary, we make the following contributions: 

• We present Crowd-ML, a general framework for machine 
learning with smart devices from crowdsensing data, with 
many potential applications. 

• We show differential privacy guarantees of Crowd-ML 
that provide a strong privacy mechanism against various 
types of attacks in crowdsensing. To the best of our 
knowledge, Crowd-ML is the first general framework 
that integrates sensing, learning, and differentially private 
mechanisms for crowdsensing. 

• We analyze the framework to show that the cost of 


privacy preservation can be minimized and that the com¬ 
putational and communication overheads on devices are 
only moderate, allowing a large-scale deployment of the 
framework. 

• We implement a prototype and evaluate the framework 
with a demonstrative task in a real environment as well 
as large-scale experiments in a simulated environment. 


The remainder of this paper is organized as follows. We 
first review related work in Section [III Section [III| describes 
the Crowd-ML framework. Section analyzes Crowd-ML 
in terms of privacy-performance trade-off, computation, and 
communication loads. Section |V] presents an implementation 
of Crowd-ML and experimental evaluations. We discuss re¬ 


maining issues and conclude in Section VI 


11. Related Work 

Crowd-ML integrates distributed learning algorithms and 
differential privacy mechanisms into a crowdsensing system. 
In this section, we review related work in crowdsensing and 
learning systems, and privacy-preserving mechanisms. 


A. Crowdsensing and learning 

There is a vast amount of work in crowdsensing, and we 
focus on the system aspect of previous work with represen¬ 
tative papers (we refer the reader to survey papers 0 and 
Q). Crowdsensing systems aim to achieve mass collection and 
mining of environmental and human-centric data such as social 
interactions, political issues of interest, exercise patterns, and 
people’s impact on the environment 0 - Examples of such 
systems include Micro-Blog ||^, PoolView p0| , BikeNet | pT| , 
and PEIR IS- Data collected by crowdsensing can also be 
used to mine high-level patterns or to predict variables of in¬ 
terest using machine learning. Applications of learning applied 
to crowdsensing include learning of bus waiting times |T3| and 
recognizing user activities (see (El for a review). Jigsaw (T5) 
and Lifestreams |T^ also use pattern recognition in sensed 
data from mobile devices. Erom the system perspective, these 
work use devices to passively sense and send data to a central 
server on which analyses take place, which we will refer to as 
the centralized approach. In contrast, sensing and learning can 
be performed purely inside each device without a server, which 
we call the decentralized approach. Eor example, SoundSense 
GZ) learns a classifier on a smartphone to recognized various 
audio events without communicating with the back-end. Mixed 
centralized and decentralized approaches are also used in 
p^ , where a portion of computation is performed off-line on 
a server. CQue provides a query interface for privacy- 
aware probabilistic learning of users’ contexts, and ACE |T9| 
uses static association rules to learn users’ contexts. System- 
wise, our work differs from those centralized or decentralized 
approaches in that we use a distributed approach to perform 
learning by devices and server together, which improves 
privacy and scalability of the system. We are not aware of 
any other crowdsensing system that takes a similar approach. 
Also, the cited papers are oriented towards novel applications. 































but our work focuses on a general framework for learning a 
wide range of algorithms and applications. 

Crowd-ML also builds on recent advances in incremental 
distributed learning p0| , ED which show that a near-optimal 
convergence rate is achievable despite communication delays. 
A privacy-preserving stochastic gradient descent method is 
presented briefly in Unlike the latter, we presents a 

complete framework for privacy-preserving multi-device learn¬ 
ing, with performance analysis and demonstrations in real 
environments. 

B. Privacy-preserving mechanisms 

Privacy is an important issue in data collection and analy¬ 
sis. In particular, preserving privacy of users’ locations has 
been studied by many researchers (see for a survey). 
To preserve privacy of general data types formally, several 
mechanisms such as /c-anonymity p4| and secure multiparty 
computation | |25| have been proposed, for data publishing 
| [26| and also for participatory sensing |[Tj. Recently, differ¬ 
ential privacy j|^-||4i has addressed several weaknesses of 
/c-anonymity 1^ , and gained popularity as a quantifiable 
measure of privacy risk. Differential privacy has been used for 
privacy-preserving data analysis platformfor sanitization 
of learned models parameters from data | |29| , and for privacy¬ 
preserving data mining from distributed time-series data (Tt) 
So far, formal and general privacy mechanisms have not been 
adopted broadly in crowdsensing. Among the crowdsensing 
systems cited in the previous section ( 0-1T3), (Tg-(Tg, 
p0| , |[3T|), onl y p0| , |T^ , |T^ provide privacy mechanisms, 
of which only |[To]|address the privacy more formally. To our 
best knowledge, Crowd-ML is the first framework to provide 
formal privacy guarantees in general crowd-based learning 
with smart devices. 


III. Crowd-ML 

In this section, we describe our Crowd-ML in detail: system, 
algorithms, and privacy mechanisms. 

A. System and workflow 

The Crowd-ML system consists of a server and multiple 
smart devices that are capable of sensory data collection, 
numerical computation, and communication over a public 
network with the server (see Fig. [^. The goal of Crowd- 
ML is to learn a classifier or predictor of interest from 
crowdsensing data collectively by multiple devices. A wide- 
range of classifiers or predictors can be learned by minimizing 
an empirical risk associated with a given task, a common 
method in statistical learning Formally, let x G be a 
feature vector from preprocessing sensory input such as audio, 
video, accelerometer, etc, and ^ be a target variable we aim 
to predict from x, such as user activity. For regression, y can 
be a real number and for classification, ^ is a discrete label 
y G C} with C classes. We define data as N pairs 

of (feature vector, target variable) generated i.i.d. from an 
unknown distribution by all participating devices up to present: 

( 1 ) 


Suppose we use a classifier/predictor h{x;w) with a tunable 
parameter vector w, and a loss function l{y, h{x; w)) to mea¬ 
sure the performance of the classifier/predictor with respect 
to the true target y. A wide range of learning algorithms can 
be represented by h and /, e.g., regression, logistic regression, 
and Support Vector Machine (see p2| for more examples). If 
there are M smart devices, we find the optimal parameters 
w of the classifier/predictor by minimizing the empirical risk 
over all M devices: 


M 
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( 2 ) 


where Vm is a set of samples generated from device m only, 
and is a regularization term. This risk function ^ can 

be minimized by many optimization methods. In this work 
we use stochastic (sub)gradient descent (SGD) which is 
one of the simplest optimization methods and is also suitable 
for large-scale learning p4| . SGD minimizes the risk by 
updating w sequentially 


w{t + 1) ^ IIh; [w{t) - r]{t)g{t)] , (3) 

where r]{t) is the learning rate, and g{t) is the gradient of the 
loss function 

g = Vwl{h{x]w),y), (4) 


evaluated with the sample {x^y) and the current parameter 
w{t). We assume the parameter domain W is a d-dimensional 
ball of some large radius R, and the projection is Ftyv; = 
min(l, i?/||u;||)u;. By default, we use the learning rate 
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( 5 ) 


where c is a constant hyperparameter. When computing gra¬ 
dients, we use a ‘minibatch’ of b samples to compute the 
averaged gradient 

9 =^y2'^V{h{xi;w),yi), (6) 


which plays an important role in the performance-privacy 
trade-off and the scalability (Section ^V\ . In Crowd-ML, risk 
minimization by SGD is performed by distributing the main 
workload (=computation of averaged gradients) to M devices. 
Note that each device generates data and compute gradients 
using its own data. The workfiow is described in Fig. 


B. Algorithms 

Crowd-ML algorithms are presented in Algorithms and 
Device Routine 1 collects samples. When the number of 
samples reaches the minibatch size b, the routine tries to 
checks out the current model parameters w from the server 
and calls Device Routine 2. Device Routine 2 computes the 
averaged gradient from the stored samples and w received 
from the server, sanitizes information by Device Routine 3, 
and sends the sanitized information to the server. Device 
Routine 3 uses Laplace noise and exponential mechanisms 
(in the next section) to sanitize the averaged gradient g, the 



Device 1 



Fig. 2: Crowd-ML workflow. 1. A device preprocesses sensory 
data and generates a sample(s). 2. When the number of 
samples {{x^y)} exceeds a certain number, the device requests 
current model parameters w from the server. 3. The server 
authenticate the device and sends w. 4. Using w and {{x^y)}, 
the device computes the gradient g and send it to the server 
using privacy mechanisms. 5. The server receives the gradient 
g and updates w. While one device is performing routines 1- 
5, another device(s) are allowed to perform the same routines 
asynchronously. Devices can join or leave the task at any time. 


number of misclassifled samples he and the label counts 
hy. Device Routines 1-3 are performed independently and 
asynchronously by multiple devices. 

Server Routine 1 sends out current parameters w when 
requested and Server Routine 2 receives checkins (g, Us, 
he,hy) from devices when requested. The whole procedure 
ends when the total number of iteration exceeds a maximum 
value Tmax, or the overall error is below a threshold p. 

Remark 1: In Device Routine 1, if check-out fails, the 
device keeps collecting samples and retries check-out later. 
A prolonged period of network outage for a device can make 
the parameter outdated for the device, but it does not affect 
the overall learning critically. Similarly, failure to check-in 
information with server in Device Routine 2 is non-critical. 

Remark 2: In Device Routine 2, we can randomly set aside 
a small portion of samples as test data. In this case, the 
misclassiflcation error is computed only from these held-out 
samples, and their gradients will not be used in the average g. 

Remark 3: In Server Routine 2, more recent update methods 
p5| , p6| can be used in place of the simple update rule 
^ without affecting differential privacy nor changing device 
routines. Similarly, adaptive learning rates p7| , can be 
used in place of (0, which can provide a robustness to large 
gradients from outlying or malignant devices. 

C. Privacy mechanism 

In crowdsensing systems, private data of users can be leaked 
by many ways. System administrators/analysts can violate the 
privacy intentionally, or they may leak private information 
unintentionally when publishing data analytics. There are also 
more hostile types of attacks: by malignant devices posing 


Algorithm 1 Device side 

Input: privacy levels eg^Ce^eyk, minibatch size b, max buffer 
size B, classifler model (C, h, I, X from Eq. ©) 

Init: set rig = 0, ng = 0, = 0, /c = 1,..., C 

Communication to server. g^Us^he^hy 
Communication from server: w 
Device Routine 1 
if Us > B then 

stop collection to prevent resource outage 
else 

receive a sample (x, y) (in a regular interval or triggered 
by events), and add to the secure local buffer 

Us = Us ^ I 

end if 

if ris >b then 

checkout w from the server via https 
call Device Routine 2. 

end if 

Device Routine 2 

Using w from the server and {{x,y)} from the local buffer. 


for i = 1, Us do 

make a prediction = h{xi;w) 

+ 1 

rie = ne + + yj 

Incur a loss 

Compute a subgradient gi = Vwl{h{xi]w)) 

end for 

Compute the average g = ^ gi + Xw 
Sanitize data with Device Routine 3 
Checkin g, he hy^ /c = 1,..., C with server via https 
Reset ns = 0, ng = 0, n^ = 0, /c = 1,..., C 
Device Routine 3 


Sample g = g -\- z from Eq. (fTO 


Sample he = Ue + z from Eq. (|ll 
Sample hy = Uy 3- z, k = 1 ,..., C from Eq. (|l2) 


as legitimate devices, by hackers poaching data stored on the 
server or eavesdropping on communication between devices 
and servers. Instead of preserving privacy separately for each 
attack type, we can preserve privacy from all these attacks 
by a local privacy-preserving mechanism that is implemented 
on each device and sanitizes any information before it leaves 
the device. A local mechanism assumes that an adversary 
can potentially access all communication between devices and 
the server, which subsumes the other attack attacks. This is 
because the other forms of data that are 1) visible to malignant 
device, 2) stored in the server, or 3) released in public, are all 
derived from what is communicated between devices and the 
server. We adopt a local e-differential privacy as a quantiflable 
measure of privacy in Crowd-ML. Formally, a (randomized) 
algorithm which takes data V as input and outputs /, is called 


























TABLE I: Multiclass logistic regression 


Algorithm 2 Sever side 

Input, number of devices M, learning rate schedule t = 
1,2,..., Tmax, desired error p, classifier model (C, h, I, X from 

Eq. (13) 

Inil: t = 0, randomized w, = 0, = 0, Ny'"^, m = 

k = 

Stopping criteria: t > T^ax or < p 

Server Routine 1 ^ 

while Stopping criteria not met do 
Listen to and accept checkout requests 
Authenticate device 
Send current parameters w to device 
end while 
Server Routine 2 

while Stopping criteria not met do 
Listen to and accept checkin requests 
Authenticate device (suppose it is device m) 

Receive g, fie, hy^ /c = 1,..., C. 

+ Us 

+ fie 

+ fi^ 

w = w — p{t)g 
t = t 1. 

end while 


e-differentially private if 


P{f{V) es) ^ 
P{f{V') gS)- 


(7) 


for all measurable S C T of the output range, and for all 
data sets V and V' differing in a single item. That is, even 
if an adversary has the whole data V except a single item, 
it cannot infer much more about that item from the output 
of the algorithm /. A smaller e makes such an inference 
more difficult, and therefore makes the algorithm more private¬ 
preserving. When the algorithm outputs a real-valued vector 
/ G its global sensitivity can be defined by 


-5(/) =max||/(D)-/(D')lli- (8) 

where || • ||i is the Li norm. A basic result from the definition 
of differential privacy is that a vector-valued function / with 
sensitivity S{f) can be made e-differentially private (D by 
adding an independent Laplace noise vector 

P{z) oc (9) 


In Crowd-ML, we consider e-differential privacy of any single 
(feature,label)-sample, revealed by communications from all 
devices to the server, which are the gradients g, the numbers 
of samples n^, the number of misclassified samples rig, and 
the labels counts The amount of noise required depends 


^ As a variant, (e, (5)-differential privacy can be achieved by adding Gaussian 
noise. 

^The communication from the server to devices {w{t)} can be recon¬ 
structed by from {g{t)}, and therefore is redundant to consider. 


Predictioi 

i arg max/j w'j^x 


Risk 


] + 


lEJKiP 


Gradient 

Xi[-I[yi = fc] + P{y = k\xi)\ + 

Xwk 


on the choice of loss functions. We compute this value for 
multiclass logistic regression (Table |^, but it can be computed 
similarly for other loss functions as well. By adding element¬ 
wise independent Laplace noise 2 ; to averaged gradients g 

g = + Z, P(z) OC (10) 


we have the following privacy guarantee: 

Theorem 1 (Averaged gradient perturbation). The transmis¬ 
sion of g by Eq. ( [7^ is eg-differentially private. 

See Appendix for proof. 

To sanitize rie and we add ‘discrete’ Laplace noise | [^ 
as follows: 

fie = rie + 2 ;, P(z) (X (11) 

+ P(z)(xe-^l^l, (12) 

where ^ = 0,±1,±2,.... These mechanisms has the following 
privacy guarantees: 


Theorem 2 (Error and label counts). The transmission of rie 
and riy by Eqs. (p7| and {12) is e^- and Cyk- differentially 
private, respectively^ 


See Appendix for proof. 

Practically, a system administrator chooses e depending on 
the desired level of privacy for the data collected. A small 
e(^ 0) may be used for data that users deem highly private 
such as current location, and a large e(^ oc) may be used for 
less private data such as ambient temperature. 


IV. Analysis 

In this section, we analyze the privacy-performance trade¬ 
off and the scalability of Crowd-ML. As discussed in Related 
Work, most existing crowdsensing systems use purely cen¬ 
tralized or purely decentralized approaches, while Crowd-ML 
uses a distributed approach. By design, Crowd-ML achieves 
differential privacy with little loss of performance (0(1/6)), 
only moderate computation load due to its simple optimization 
method, and reduced communication load and delay (0(1/6)), 
where 6 is the minibatch size. 


A. Privacy Performance 

Privacy costs performance: the more private we make the 
system, the less accurate the outcome of analysis/learning 
is. Prom Theorem 1, Crowd-ML is e-differentially private 
by perturbing averaged gradients. The centralized approach 
can also be made e-differentially private by feature and 
label perturbation (Appendix 0- Below we compare the 










impact of privacy on performance between the centralized 
and Crowd-ML. The performance of an SGD-based learning 
can be represented by its rate of convergence to the optimal 
value/parameters at iteration t, which in turn 

depends on the properties of the loss /(•) (such as Lipschitz- 
continuity and strong-convexity) and the step size 77(t), with 
the best known rate being 0{l/t) | [4Q| . When other conditions 
are the same, the convergence rate is roughly proportional 
E[l{w{t)) — /(u;*)] (X to the amount of noise in the 
estimated gradient = sup^ E[||^(t)|p] pl] |. For Crowd- 
ML, we have from 

1 ‘^9 D 

E[ll5ll"] = E[||5||^] + E[||^f ] = -E[||5||2] + (13) 

where the first term is the amount of noise due to sampling, 
and the latter is due to Laplace noise mechanism with D- 
dimensional features. By choosing a large enough batch size 
b, the impact of sampling noise and Laplace noise can be made 
arbitrarily smalj^ In contrast, the centralized approach has 
to add Laplace noise of constant variance ^ to each feature 
and perturb labels with a constant probability (Appendix [^. 
Regardless of which optimization method is used (SGD or 
not), the centralized approach has no means of mitigating the 
negative impact of constant noise on the accuracy of learned 
model, which will be especially problematic with a small e. 

In the decentralized approach, a device need not interact 
with a server, and is almost free of privacy concerns. However, 
the increased privacy comes at the cost of performance. In 
Crowd-ML and the centralized approach, samples pooled from 
all devices are used in the learning process, whereas in the 
decentralized approach, each device can use only a fraction 
(^ 1/M) of samples. This undermines the accuracy of a model 
learned by the decentralized approach. For example, it is 
known from the VC-theory for binary classification problems 
that the upper-bound of the estimation error with a 1 /M-times 
smaller sample size is larger |43| . 

B. Scalability 

Scalability is determined by computation and communi¬ 
cation loads and latencies on both device and server sides. 
We compare these factors between centralized, crowd, and 
decentralized learning approaches. 

1) Computation load: For all three approaches, we assume 
the same preprocessing is performed on each device to com¬ 
pute features from raw sensory input or metadata. On the de¬ 
vice side, the centralized learning approach requires generation 
of Laplace noise per sample on the device. The crowd and 
the decentralized approaches perform partial and full learning 
on the device, respectively, and requires more processing. 
Specifically, Crowd-ML requires computation of a gradient per 
sample, a vector summation (for averaging) per sample, and 
generation of Laplace random noise per minibatch. A low- 
end smart device capable of floating-point computation can 

^although a larger batch size means fewer updates given the same number 
of samples N, and too large a batch size can negatively affect the convergence 
rate (see (4^ for discussion). 


perform these operations. The decentralized learning approach 
can use any learning algorithms, including SGD similar to 
Crowd-ML. However, if the decentralized approach is to make 
up for the smaller sample size (1/M) compared to Crowd- 
ML, it may require more complex optimization methods which 
results in higher computation load. For all three approaches, 
the number of devices M do not affect per-device computation 
load. Computational load on the server is also different for 
these approaches. The centralized approach puts the highest 
load on the server, as all computations take place on the 
server. In contrast, Crowd-ML puts minimal load on the server 
which is the SGD update ([^, since the main computation is 
performed distributed by the devices. 

2) Communication load: To process incoming streams of 
data from the device in time, the network and the server should 
have enough throughput. The centralized learning approach 
requires N number of samples to be sent over the network to 
the server. For Crowd-ML with a minibatch size of b, devices 
send N/b gradients altogether, and receives the same number 
of current parameters, both of the same dimension as a feature 
vector. Therefore, the data transmission is reduced by a factor 
of 6/2 compared to the centralized approach. 

3) Communication latency: When using a public (and 
mobile) network, latency is non-negligible. In the centralized 
approach, latency may not be an issue, since the server need 
not required to send any real-time feedback to the devices. In 
Crowd-ML, latency is an issue that can affect its performance. 
There are three possible delays that add up to the overall 
latency of communication: 

• Request delay(rreq): time since the check-out request 
from a device until the receipt of the request at the server 

• Check-out delay (tco): time since the receipt of a request 
at the server and the receipt of the parameter at the device 

• Check-in delay (rci): time since the receipt of the param¬ 
eters at the device until the receipt of the check-ins at the 
server 

Due to delays, if a device checks out the parameter w at time 
to and checks in the gradient g and the server receives g 
at time to + ^co + the server may have already updated 
the parameters w multiple times using the gradients from 
other devices received during this time period. This number 
of updates is roughly (tco + ^d) x MFsjb, where M is the 
number of devices, Fg is the data sampling rate per device, and 
1/6 is the reduction factor due to minibatch. Again, choosing 
a large batch size 6 relative to MFg can reduce the latency. 
While exact analysis of impact of latency is difficult, there are 
several related results known in the literature without consid¬ 
ering privacy. Nedic et al. proved that delayed asynchronous 
incremental update converges with probability 1 to an optimal 
value, assuming a finite maximum latency. Recent work in 
distributed incremental update | [2Q| , 0 also shows that a 
near-optimal convergence rate is achievable despite delays. In 
particular, Dekel et al. (2D shows that delayed incremental 
updates are scalable with M by adapting the minibatch size. 




V. Evaluation 


In this section, we describe a prototype of Crowd-ML 
implemented on off-the-shelf android phones and activity 
recognition experiments on android smartphones. We also per¬ 
form digit and object recognition experiments under varying 
conditions in simulated environments and demonstrate the 


advantages of Crowd-ML analyzed in Section IV 


A. Implementation 

We implement a Crowd-ML prototype with three compo¬ 
nents: a Web portal, commercial off-the-shelf smart devices, 
and a central server. On the device side, we implement 
Algorithm on commercial off-the-shelf smartphones as an 
app using Android OS 4.3+. Our prototype uses smartphones, 
but will be easily ported to other smart device platforms. 
On the server side, we implement Algorithm on a Lenovo 
ThinkCentre M82 machine with a quad-core 3.2 GHz Intel 
Core 15-3470 CPU and 4 GB RAM running Ubuntu Linux 
14.04. The server runs the Apache Web server (version 2.4) 
and a MySQL database (version 5.5). 

Also on the server side, our Crowd-ML prototype provides 
a Web portal over HTTPS where users can browse ongoing 
crowd-learning tasks and join them by downloading the app 
to their smart devices. To enhance transparency, details of 
tasks (objective, sensory data collected, labels collected, and 
learning algorithms used) and our privacy mechanisms is ex¬ 
plained. It also displays timely statistics about crowd-learning 
applications such as error rates and activity label distributions, 
which are differentially private. We implement the portal in 
Python using the Djangc]^ Web application framework and 
Matplotlitj^ for statistical visualization. 


B. Activity Recognition in Real Environments 

In this experiment, we perform activity recognition on smart 
devices. The purpose of this demonstration is to show Crowd- 
ML working in a real environment, so we choose a simple task 
of recognizing three types of user activities (“Still”, “On Loot”, 
and “In Vehicle”). We install a prototype Crowd-ML applica¬ 
tion on 7 smartphones (Galaxy Nexus, Nexus S, and Galaxy 
S3) running Android 4.3 or 4.4. The seven smartphones are 
carried by college students and faculty over a period of a 
few days. The devices’ triaxial accelerometers are sampled 
at 20 Hz. In this demonstration, we avoid manual annotation 
of activity labels to facilitate data acquisition, and instead 
use Google’s activity recognition service to obtain ground 
truth labels. Acceleration magnitudes \a\ = 
are computed continuously over 3.2 s sliding windows. Lea- 
ture extraction is performed by computing the 64-bin LET 
of the acceleration magnitudes. We set the sampling rate 
Fg = 1/30 Hz, that is, a feature vector x and its label y 
is generated every 30 s. However, to avoid getting highly 
correlated samples and to increase diversity of features, we 
collect a sample only when its label has changed from its 


previous value. Lor example, samples acquired during sleeping 
are discard automatically as they all have “Still” labels. This 
lowers the actual sampling rate to about Fg = 1/352 Hz (or 
every six minute or so). With this low rate, no battery problem 
was observed. 

We use 3-class logistic regression (Table with A = 0, 6 = 
l,e“^ = 0 and a range of 77 values. Repeated experiments 
with different parameters are time-consuming, and we leave 
the full investigation to the second experiment in a simulated 
environment. In Lig.[^ we shows the collective error curves for 


Prediction error 



Lig. 3: Time-averaged error across all devices for activity 
recognition task. 


the first 300 samples from the 7 devices. The error is a time- 
averaged misclassification error as the learning progresses: 
Err(t) = The error curves for 

different learning rates 0 are very similar, and virtually 
converge after only 50 samples (=7 samples per device). This 
experiment is a proof-of-concept that Crowd-ML can learn a 
common classifier fast, from only a small number of samples 
per user. 


C. Digit/Object Recognition in Simulated Environments 

To evaluate Crowd-ML under various conditions, we per¬ 
form a series of experiments on handwritten digit recognition 
and visual object recognition. Since the two results are quite 
similar, we only describe the digit recognition results (object 
recognition result is in Appendix |^. The MNIST datasej^ 
consists of 60000 training and 10000 test images of handwrit¬ 
ten digits (0 to 9), which is a standard benchmark dataset for 
learning algorithms. The task is to classify a test image as 
one of the 10 digit classes. The images from MNIST data are 
preprocessed with PCA to have a reduced dimension of 50, 
and Li normalized. In this experiment, we compare the perfor¬ 
mance of centralized, Crowd-ML, and decentralized learning 
approaches using the same data and classifier (multiclass 
logistic regression), under different conditions such as privacy 
level e, minibatch size 6 , and delays. To test the algorithms 
with a full control of parameters, we run the algorithms in 
a simulated environment instead of on a real network. We 
can therefore choose the number of devices and maximum 
delays arbitrarily. Lor simplicity, we set r = Treq = r^o = 


(Section [IV-B3| ). The r is the maximum delay, and the actually 
delays are sampled randomly and uniformly from [ 0 ,r] for 
each communication instance 0 


^ http://www.djangoproject.com 
“ http://matplotlib.org 


^ http://yann.lecun.com/exdb/mnist/ 

^We can test with any distribution other than uniform distribution as well. 













All results in this section are averaged test errors from 10 
trials. For each trial, assignment of samples, order of devices, 
perturbation noise, and amounts of delay are randomized. 
Test errors are computed as functions of the iteration (=the 
number of samples used), up to five passes through the data. 
Hyperparameters A (Table and c ^ are selected from the 
averaged test error from 10 trials. We set the number of devices 
M = 1000. Consequently, each device has 60 training and 10 
test samples on average. 

Fig. 1^ compares the performance of the centralized, crowd, 
and decentralized learning approaches, without privacy or 
delay =0, 6 = 1, r = 0). The error of centralized 
batch training is the smallest (0.1), in a tie with Crowd-ML. 
The error curve of Crowd-ML converges to the same low value 
as centralized approach. It shows that incremental update by 
SGD in Crowd-ML is as accurate as batch learning, when 
privacy and delay are not considered. In contrast, the error 
curve of decentralized approach converges at a slower rate 
and also converges to a high error (^ 0.5), despite using the 
same overall number of samples as other algorithms, due to 
the lack of data sharing. 



Fig. 4: Comparison of test error for centralized, crowd, and 
decentralized learning approaches, without delay or privacy 
consideration. The curves show how error decreases as the 
number of iteration (=number of samples used) increases over 
time. The batch algorithm is not incremental and therefore is 
a constant. 


We perform tests with varying levels of privacy e. The 
privacy impacts the centralized approach via (lA) and ( Tbp 
and also Crowd-ML via With low privacy (e ^ ^ 0), 
the performance of both centralized and crowd approaches 
are almost the same as Fig. and we omit the result. With 
high privacy (e -> 0), the performance of both approaches 
degrades to a unusable level. Here we show their performances 
at = 0.1 in Fig.|^ where the performance is in a transition 
state between high and low privacy regions. Firstly, the central¬ 
ized and crowd approaches both perform worse than they did 
in Fig. 1^ which is the price of privacy preservation. Among 
these results, Crowd-ML with a minibatch size 6 = 20 has the 
smallest asymptotic error, much below the centralized (batch). 
Crowd-ML with 6=1 and 10 still achieves similar or better 
asymptotic error compared to Central (batch). As predicted 
from Section |IV[ increasing the minibatch size improves the 
performance of Crowd-ML. When SGD is used for centralized 
approach (Central SGD) with perturbed features and labels, its 


^The features and labels for test data are not perturbed. 


performance is very poor (^0.9) regardless of the minibatch 
size, due to the larger noise required to provide the same level 
e of privacy as Crowd-ML. 



Iteration 


- Central (SGD,b=1) 

- Central (SGD,b=10) 

- Central (SGD,b=20) 

- Crowd-ML (SGD,b=1) 

- Crowd-ML (SGD,b=10) 

- Crowd-ML (SGD,b=20) 

Central (batch) 


Fig. 5: Comparison of test error for centralized and crowd 
learning approaches with privacy = 0.1), varying mini¬ 
batch sizes (6), and no delay. 


Lastly, we look at the impact of delays on Crowd-ML with 
privacy = 0.1. We test with different delays in the unit 
of A = r/{MFs), that is, the number of samples generated 
by all device during the delay of size r. In Fig. we show 
the results with two minibatch sizes (6 = 1,20) and varying 
delays (lA, lOA, lOOA, lOOOA). The delay of lOOOA means 
that a maximum of 3 x 1000 samples are generated among the 
devices, between the time a single device requests a check-out 
from the server and the time the server received the check-in 
from that device, which is quite large. Fig. shows that the 
increase in the delay somewhat slows down the convergence 
with a minibatch size of 1, and the converged value of error 
is similar to or worse than Central (batch). However, it also 
shows that with a minibatch size of 20, delay has little effect 
on the convergence, and the error is much lower than Central 
(batch). Note that with the minibatch size of 20, there is a 
small plateau in the beginning of error curves, refiecting the 
fact that the devices are initially waiting for their minibatches 
to be filled before computing begins. After this initial waiting 
time, the error starts to decrease at a fast rate. 



Iteration ^ ^ qS 


- Crowd-ML (b=1,1 A) 

Crowd-ML (b=1,10A) 

- - Crowd-ML (b=1,100A) 

- Crowd-ML (b=1,1000A) 

- Crowd-ML (b=20,1A) 

. Crowd-ML (b=20,10A) 

- Crowd-ML (b=20,100A) 

- Crowd-ML (b=20,1000A) 

- Central (batch) 


Fig. 6: Impact of delays on Crowd-ML with privacy (e ^ = 
0 .1), varying minibatch sizes, and varying delays. 


VI. Conclusion 

In this paper, we proposed Crowd-ML, a machine learn¬ 
ing framework for a crowd of smart devices. Compared to 



































previous crowdsensing systems, Crowd-ML is a framework 
that integrates sensing, learning, and privacy mechanisms 
together, and can build classifiers or predictors of interest 
from crowdsensing data using computing capability of smart 
devices. Algorithmically, Crowd-ML utilizes recent advances 
in distributed and incremental learning, and implements strong 
differentially private mechanisms. We analyzed Crowd-ML 
and showed that Crowd-ML can outperform centralized ap¬ 
proaches while providing better privacy and scalability, and 
can also take advantages of larger shared data which decen¬ 
tralized approaches cannot. We implemented a prototype of 
Crowd-ML and evaluated the framework with a simple activity 
recognition task in a real environment as well as larger-scale 
experiments in simulated environments which demonstrate 
the advantages of the design of Crowd-ML. Crowd-ML is a 
general framework for a range of different learning algorithms 
with crowdsensing data, and is open to further refinements for 
specific applications. 


Appendix 

A. Proof of Theorem 1 

In our algorithms, a device receives w from the server and 
sends averaged gradients g along with other information. We 
assume ||x||i < 1 which can be easily achieved by normalizing 
the data. The sensitivity of an averaged gradient for logistic 
regression is 4/6 as shown below. There are C parameter 
vectors wi^...^wc for multiclass logistic regression. Let the 
matrix of gradient vectors corresponding to C parameter 
vectors be 


ee-differential privacy follows from Theorem 6 of | |44| . Proof 
of -differential privacy of riy is similar. 

Remark 1: Unlike the gradient g, the information (n^, he, 
hy) is not required for learning itself, but for monitoring the 
progress of each device on the server side. Therefore, eg and 
Cyk can be set to be very small without affecting the learning 
performance, so that e = Cg ^ Ce + Ccyk ^ eg. 

Remark 2: he and hy can be negative with a small proba¬ 
bility, but have a limited effect on the estimates of the error 
rate and the prior at the server. After receiving T minibatches, 
the error rate and the prior estimates are 


Err®"* = and = k)='^ 


fiyii) 


T,i 


Ei iT-sii) 


(14) 


Since he{i)—ne{i) is independent for i = 1^, ... and has zero- 

2e“^e/2 
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, the estimate of 
true error rate with 


mean and constant variance jz - 

error rate converge almost surely to the 
vanishing variances as T increases. The same can be said of 
the estimate of prior P{y). 


C. Differential Privacy in Centralized Approach 

For completeness of the paper, we also describe the e- 
differential privacy mechanisms for the centralized approach. 
In the centralized approach, data are directly sent to the server. 
Without a privacy mechanism, an adversary can potentially 
observe all data. To prevent this, e-differential privacy can be 
enforced by perturbing the features 


9 = [9192 ■■■ 9c] = x[Pi ■ ■ ■ Py-l ■■■ Pc] + A[u;i ■ ■ ■ 

= xM A\[wi ••• wc]^ 

where Pj = P{y = j\x] w) the posterior probability, and M 
is a row vector of Pj’s. Without loss of generality, consider 
two minibatches V and V' that differ in only the first sample 
xi. The difference of averaged gradients g{V) and g'{V') is 

||5-5'l|i<^(lkiMi||i + KM(||i)</ 


f{x)=x-\-z^ , P{z) (X e 2 ^ (15) 

and also perturbing the labels. To perturb labels, we use 
exponential mechanism to sample a noisy label y given a true 
label y from 

P(y|y) cx y, y e {1,C} (16) 

where we use the score function d{y^ y) = I[y = y]. 


To see ||Mi ||i < 2, note that the absolute sum of the entries of 
Ml is 2(1 — Pyf) < 2. The sensitivity of multiple minibatches 
^(1),..., ^(T) is the same as the sensitivity of a single g{t), 
and the e-differential privacy follows from Proposition 1 of 

0 . 

B. Proof of Theorem 2 

In addition to the averaged gradients, a device sends to the 
server the numbers of samples the number of misclassified 
samples rie, and the labels counts riy. Perturbation by adding 
discrete Laplace noise i s eq uivalent to random sampling by 
exponential mechanism |44| with P{he\ne) oc 
fig G Z. If two datasets V and V' are different in only one 
item, then the score function d = — \he — ne\ changes at most 
by 1. That is, maxx>^T>' \d{ne,ne{'D)) - d{ne,ne{'D'))\ = 1. 
As with multiple gradients, the sensitivity of multiples sets 
of (fig, hy) is the same as the sensitivity of a single set, and 


Theorem 3 (Feature and label perturbation). The transmission 
of X and y by feature perturbation (|Z3 and exponential 
mechanism {16) is e^- and ey-differentially private. 


Proof: Assume ||x||i < 1. Feature transmission is an 
identity operation and therefore has sensitivity 2. For label 
transmission, the score function d{y,y) = I[y = y] changes 
at most by 1 by changing y. From Proposition 1 of 
and Theorem 6 of | [44| , respectively, we achieve Cx- and Cy- 
differential privacy of data. ■ 

Note that the sensitivity is independent of the number of 
features and labels sent, and we have to add the same level 
of independent noise to the features and apply the same 
amount of label perturbation. An overall e-differential privacy 
is achieved by e = e^^ -he^. The required privacy levels ex and 
ey can be chosen differently, and we use e^^ = e^ = e/2 in the 
experiments. 








Test error 


D. Experiments with Visual Object Recognition Task 


We repeat the experiments in Section |V-C| for an object 
recognition task using CIFAR-10 dataset, which consists of 
images of 10 types of objects (airplane, automobile, bird, 
cat, deer, dog, frog, horse, ship, truck) collected by | [45| . 
We use 50,000 training and 10,000 test images from CIFAR- 
10. To compute features, we use a convolutional neural 
network trained using ImageNet ILSVRC2010 datasej^ 
which consists of 1.2 million images of 1000 categories. We 
apply CIFAR-10 images to the network, and use the 4096- 
dimensional output from the last hidden layer of the network as 
features. Those features are preprocessed with PC A to have a 
reduced dimension of 100, and are Li normalized. We use the 
same setting in Section IWO to test Crowd-ML on this object 
recognition task. The results are given in Figs. The 

figures are very similar to the handwritten digit recognition 
task (Figs. ES©, except that the error is larger (e.g., 0.3 in 
Fig.|^ than the error for digit recognition (0.1 in Fig.[^. This 
is because CIFAR dataset is more challenging than MNIST 
due to variations in color, pose, view point, and background 
of object images. 



Iterations 


- Decentral (SGD) 

- Crowd-ML (SGD) 

- Central (batch) 


Fig. 7: Comparison of test error for centralized, crowd, and 
decentralized learning approaches, without delay or privacy 
consideration. 


Test error 



- Central (SGD,b=1) 

- Central (SGD,b=10) 

- Central (SGD,b=20) 

- Crowd-ML (SGD,b=1) 

- Crowd-ML (SGD,b=10] 

- Crowd-ML (SGD,b=20] 

- Central (batch) 


Fig. 8: Comparison of test error for centralized and crowd 
learning approaches with privacy (e“^ = 0.1), varying mini¬ 
batch sizes (6), and no delay. 
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